Reproducible walk-through of codes and data for Buphamalai et.al.
This supplementary report is aimed to be a reproducible walk-through guide for figures and analyses complementing the manuscript: Buphamalai et.al., Network analysis reveals rare disease signatures across multiple levels of biological organization, submitted to Nature Communications. PLease find the following guidelines:
Show code.R, sh, or py scripts required for each analysis are mentioned for each section. These files are available in ./source folder../cache folder and can be downloaded from: linkRmd files used to produce this report can be found in ./report/ folder. The report mainly focuses on reproducing results shown in the manuscripts. For explanation, discussion, and references, please refer to the main manuscript.knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(pacman)
p_load(patchwork, igraph, tidyverse, cowplot, rmarkdown)
# compute different network properties
if(!file.exists("../cache/network_complementarity_topological.RDS")){
# load required functions
source("../source/network_properties_analysis.R")
} else{
print("Load precomputed data")
g_prop_df <- readRDS("../cache/network_complementarity_topological.RDS")
}
[1] "Load precomputed data"
network_details <- read_tsv("../data/network_details.tsv", col_types = 'ccccc')
45 Network layers from six major databases were constructed as detailed below:
paged_table(network_details %>% select(!type))
In addition to the scale comprehensiveness, these networks are also topologically complementary. A number of key network properties including node and link coverage, modularity, assortativity, and social bias, have been compared and shown below.
Social bias: many networks were constructed based on curation from literatures. The social bias of a network is assessed by the Spearman’s correlation coefficient between the network degree of a gene and the number of publications mentioning the gene. The number of publications was queried using the INDRA python module (http://www.indra.bio, accessed on 12 April 2019)
Note that, for the co-expression layer, the values showed on the table above are averaged from all 38 tissue-specific networks.
The plot below summarises the table properties.
# create a list of plots to patch together
plot = list()
for(prop in unique(g_prop_df$property)){
plot[[prop]] = g_prop_df %>%
arrange(group) %>% filter(property == prop) %>%
ggplot( aes(x=group, y=as.numeric(value))) +
geom_segment( aes(x=group, xend=group, y=0, yend=value), color="grey80", size=1.5) +
geom_violin(fill="#F8B100", alpha = 0.4, color = NA) +
geom_point( aes(color=alphaval), size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
# axis.text.y = element_blank(),
) +
guides(color = F)+
xlab("") +
scale_color_manual(values = c("#F8B100",NA)) +
ylab(prop)
# scale y log for some properties (n edges)
if(prop %in% c("Number of edges")){
plot[[prop]] = plot[[prop]] + scale_y_log10()
}
# for the first plot, allows axis label
if(!prop %in% c("Number of nodes")){
plot[[prop]] = plot[[prop]] + theme(axis.text.y = element_blank())
}
}
plot_combine = plot$`Number of nodes` + plot$`Edge density` + plot$`Global clustering` + plot$Assortativity + plot$`Social bias` + plot_layout(nrow = 1)
# uncomment to save the plot as pdf
#ggsave("../Figs/network_properties_characterisation.pdf", plot_combine, height = 2.5, width = 9)
suppressWarnings(print(plot_combine))
We quantified the similarity of a given pair of networks \(g_A \in G(V_A, E_A)\) and \(g_B \in G(V_A, E_A)\) using the edge overlap index: \[S_{AB}=\dfrac{|E_A \cap E_B|}{\text{min}(|E_A|,|E_B|)}\] We used a dissimilarity measure defined as \(d_{AB} = 1 - S_{AB}\) to construct a 2D map \(\mathbf{X} \subset \mathbb{R}^{2}\) that preserves network dissimilarities by employing Kruskal’s non-metric multidimensional scaling (R package MASS) 75. Finally, we compared the measured similarity of each network pair to random expectation: For each network, we performed 10 permutations of node indices, resulting in 100 permutations for a network pair which we used as random reference distribution to assess the measured overlap similarity. We then computed \(z\)-score and the corresponding empirical \(p\)-value. A network pair with \(p-\)value < 0.05 is considered significant.
The MDS plot derived from Jaccard and Overlap Similarity is as follows:
pacman::p_load(ggrepel, MASS)
# load the precomputed data
if(!file.exists("../cache/network_jaccard_overlap_similarity_df.RDS")){
source("../source/compute_jaccard_similarity.R")
} else{
print("load pre-computed network similarity data")
network_sim_df <- readRDS("../cache/network_jaccard_overlap_similarity_df.RDS")
}
[1] "load pre-computed network similarity data"
# turn df to weight symmatrix matrix through graph
g_overlap <- graph_from_data_frame(network_sim_df[,c(1,2,4)] %>%
rename(., weight = overlapindex), directed = F)
sim_overlap <- get.adjacency(g_overlap, attr = "weight")
diag(sim_overlap) = 1
#change similarity to distance
dist_overlap = 1 - sim_overlap
############
# MDS plot normal
#### MDS plot for Kruskal
mds<- isoMDS(as.matrix(dist_overlap), k = 2)
initial value 46.790384
iter 5 value 32.191184
iter 10 value 23.308862
iter 15 value 21.770579
final value 21.704111
converged
# a data frame of MDS values
mds_df = data.frame(x = mds$points[,1], y = mds$points[,2], network = rownames(mds$points))
# add network metadata and node size
mds_df = mds_df %>%
left_join(., g_prop_df %>% dplyr::filter(property=="Number of nodes") %>%
dplyr::select(network, value)) %>%
left_join(., network_details) %>%
dplyr::filter(!is.na(main_type)) %>%
mutate(label = ifelse(!grepl("coex", network), subtype, "")
# collabel = ifelse(!is.na(type), type, subtype)
)
# plot the scatters of all networks
p <- mds_df %>%
ggplot() +
geom_point(aes(x, y, col = main_type, size = value), alpha = 0.5) +
geom_text_repel(aes(x, y, label = label)) +
theme_cowplot() +theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()) +
xlab("MDS1") + ylab("MDS2") +
scale_color_manual(values = c("#F8B100", "#005564"))+
guides(col = F, size = F)
p
#ggsave("../Figs/scatter_Network_complementarity_MDS_Overlap.pdf",plot = p, width = 4, height = 4)
Overlap index:
network_sim_df %>% pull(overlapindex) %>% summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0001835 0.0121798 0.0331371 0.0419274 0.0543372 0.4899776
Median overlap indices between co-expression networks are 0.0433192 and non co-expression networks are 0.0183954.
## Jaccard heatmap
## remove coex_core from the map
p1 = network_sim_df[,1:3] %>% dplyr::filter(!grepl("core", V1), !grepl("core", V2)) %>%
# heatmap plot
ggplot() + geom_tile(aes(x = V1, y = V2, fill = jaccardIndex)) + scale_fill_distiller(direction = 1) + xlab("") +ylab("") + ggtitle("Jaccard similarity among all networks") + theme_minimal()+ theme(axis.text.x = element_text(angle = 90))
p1
We use core transcriptional modules to represent all of the co-expression network. The heatmap below shows the Jaccard and Overlap similarity.
## remove coex_core from the map
considered_networks = c("coex_core", "reactome_copathway", "ppi", "MP", "HP", "GOMF", "GOBP")
labels = c("co-expression", "co-pathway", "PPI", "MP", "HP", "GOMF", "GOBP ")
# Jaccard index
#######
p1 = network_sim_df[,1:3] %>% dplyr::filter(V1 %in% considered_networks, V2 %in% considered_networks ) %>%
# rescale factor
mutate(V1 = factor(V1, levels = considered_networks, labels = labels),
V2 = factor(V2, levels = considered_networks, labels = labels)) %>%
# heatmap plot
ggplot() + geom_tile(aes(x = V1, y = V2, fill = jaccardIndex)) + scale_fill_distiller(direction = 1) + xlab("") +ylab("") + ggtitle("Jaccard similarity") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
#ggsave("../Figs/heatmap_overlap_index_aggregated_jaccard.pdf", p1, width = 5, height = 4)
# Overlap index
#######
overlap_df <- network_sim_df[,c(1,2,4)] %>% dplyr::filter(V1 %in% considered_networks, V2 %in% considered_networks ) %>%
# rescale factor
mutate(V1 = factor(V1, levels = considered_networks, labels = labels),
V2 = factor(V2, levels = considered_networks, labels = labels))
# heatmap plot
p2 = ggplot(overlap_df) + geom_tile(aes(x = V1, y = V2, fill = overlapindex)) +
scale_fill_distiller(direction = 1) +
xlab("") +ylab("") +
ggtitle("overlap similarity ") +
theme_cowplot() +
theme(axis.text.x = element_text(angle = 45, hjust=1), legend.position = "bottom")
#ggsave("../Figs/heatmap_overlap_index_aggregated_overlap.pdf", p2, width = 4, height = 5)
p1 + p2
[1] "average overlap index for all networks are 0.060363"
For a given pair of network \(g_A \in G(V_A, E_A)\) and \(g_B \in G(V_A, E_A)\), we computed edge similarity through overlap index:
\[S_{AB}=\dfrac{|E_A \cap E_B|}{\text{min}(|E_A|,|E_B|)}\]
Foe each network, we performed 10 permutations of node indices, resulting in 100 permutations for a network pair, wehere we obtained the reference distribution for their similarity. We then computed \(z\)-score and the corresponding empirical \(p\)-value. A network pair with \(p-\)value < 0.05 is considered significant (../source/network_overlap_randomisation.R)
[1] "load the precomputed value from cache"
# network similarity values
network_sim_df <- readRDS("../cache/network_jaccard_overlap_similarity_df.RDS") %>%
dplyr::filter(!grepl("core", V1), !grepl("core" , V2))
# edge counts
ecounts <- readRDS("../cache/network_complementarity_topological.RDS")%>%
dplyr::filter(property=="Number of edges") %>%
pull(value, name = network)
# compute minimum edge size for each pair
network_sim_df$min_ecount <- apply(network_sim_df, 1, function(x) min(ecounts[x[1]], ecounts[x[2]]))
# process jaccard and overlap index
jaccard_index <- lapply(randomisation_overlap_result, function(x) x$intersect/x$union)
overlap_index <- lapply(1:length(randomisation_overlap_result), function(x)
randomisation_overlap_result[[x]][['intersect']]/network_sim_df$min_ecount[x])
# compute mean and sd, zscore and pvalue
network_sim_df <- network_sim_df %>%
mutate(jaccard.mean = sapply(jaccard_index, mean),
jaccard.sd = sapply(jaccard_index, sd),
overlap.mean = sapply(overlap_index, mean),
overlap.sd = sapply(overlap_index, sd),
jaccard.zscore = (jaccardIndex-jaccard.mean)/jaccard.sd,
jaccard.pval = pnorm(jaccard.zscore, lower.tail = F),
overlap.zscore = (overlapindex-overlap.mean)/overlap.sd,
overlap.pval = pnorm(overlap.zscore, lower.tail = F)
)
# label p values into classes
network_sim_df <- network_sim_df %>%
# make p values in groups
mutate(overlap.pval_level = cut(overlap.pval,
breaks = rev(c(1,5e-2, 1e-3, 1e-4, 1e-5, 0)),
labels = rev(c("ns","*","**","***","****")),
include.lowest = T, ordered_result = T),
# compute whether the pair are from both co-expression, or non co-expressions
type1 = grepl("coex", V1),
type2 = grepl("coex", V2),
pair_name = paste(str_remove(V1,"coex_|reactome_"),
str_remove(V2,"coex_|reactome_"), sep = " - "),
# label only if overlap score higher than 0.2
pair_label = ifelse(overlapindex > 0.2, pair_name, ""),
type_pair = factor(type1+type2, levels = 0:2, labels = c("non.coex - non.coex",
"coex - non.coex",
"coex - coex"))) %>%
mutate(overlap.pval_level = factor(overlap.pval_level, levels = rev(levels(overlap.pval_level))))
network_sim_df %>% count(overlap.pval_level) %>% paged_table()
Despite their wide range of similarity scores, we found that 955 out of 990 network pairs (96.5%) are significantly more similar than random expectation.
# stable plot results
p_scatter = ggplot(network_sim_df, aes(x = overlapindex,
y = log2(overlapindex/overlap.mean))) +
geom_point(aes(col = overlap.pval_level)) +
scale_colour_viridis_d(direction = -1) +
theme_minimal() +
xlab(expression(Similarity~(S[AB]))) + labs(title = "Network pair similarity", col = "significance") +
ylab(expression(log[2](S[AB]/mu[S[AB]])))
# plot by type
p_scatter_by_type <- p_scatter +
facet_grid(. ~ type_pair) +
geom_text_repel(aes(label = pair_label)) +
theme_cowplot() +
theme(legend.position = "bottom")
# p value plot by level
pval_lv_count_df <- network_sim_df %>% count(overlap.pval_level, name = "count")
pval_lv_count_by_type_df <- network_sim_df %>%
count(overlap.pval_level, type_pair, name = "count") %>%
group_by(type_pair) %>%
mutate(prop = count/sum(count))
p_count_by_type = ggplot(pval_lv_count_by_type_df, aes(x = overlap.pval_level, y = prop)) +
geom_col(aes(fill = overlap.pval_level)) +
scale_fill_viridis_d(direction = -1) +
xlab("Significance") + ylab("proportion") + guides(fill = F) +
theme_minimal() + facet_grid(. ~ type_pair) + labs(title = "Network similarity significance level") +
theme_cowplot()
p = p_scatter_by_type/p_count_by_type
#ggsave("../Figs/scatter_randomisation_network_similarity.pdf", p, width = 7, height = 6)
p
Interestingly, we also observed that networks on different scales (i.e. among non co-expression layers) are all significantly similar, showing that there are key edges being maintained across genotype to phenotype.
We characterised our tissue-specific co-expression networks based on GTEx. Our hypothesis is that genes that are highly co-expression across all tissues are likely required for cellular developments and survival, and should show a strong correlation with of essentiality. In this analysis, we downloaded the list of human essential genes from the OGEE database (v2), also included in /data/OGEE_esential_genes_20190416.txt.
## coexpresssion - share edges
## goal: to observe whether the shared edges among co-expression networks are essential
# 0 - load required data
## load coexel sum
library(pacman)
p_load(tidyverse, cowplot, knitr)
coex_el_sum_grouped = readRDS("../cache/coexpression_edge_counts_by_group.RDS")
# create a vector of total probability for each class
coex_el_sum_score = coex_el_sum_grouped %>%
ungroup() %>%
group_by(essential_edge_score) %>%
summarise(count = sum(n)) %>% pull(count, name = essential_edge_score)
coex_el_sum_grouped %>%
dplyr::rename(n_tissues = n_binned_relabel) %>%
group_by(n_tissues) %>%
summarise(n_edges = sum(n), percent = sum(n)*100/sum(coex_el_sum_score)) %>%
kable
| n_tissues | n_edges | percent |
|---|---|---|
| 0-5 | 12056143 | 91.8978690 |
| 5-10 | 747278 | 5.6961215 |
| 10-15 | 209979 | 1.6005635 |
| 15-20 | 70745 | 0.5392533 |
| 20-25 | 23372 | 0.1781529 |
| 25-30 | 7347 | 0.0560025 |
| 30-38 | 4203 | 0.0320373 |
The plot for higher essentiality aas number of genes increased is shown below.
bar_essential = ggplot(coex_el_sum_grouped) +
geom_bar(aes(x = n_binned_relabel, y = n, fill = score), stat = "identity", position="fill") +
xlab("Number of tissues") + ylab("Edge proportion") + theme_cowplot() +
theme(legend.position = "bottom", axis.text.x = element_text(size = 10)) +
guides(fill = guide_legend(title = "Essential gene in edge")) + scale_fill_manual(values = c('grey60','#9ecae1','#3182bd'))
bar_essential
#ggsave("./Figs/essentiality_co-expression.pdf", bar_essential, width = 4, height = 4)
coex_el_sum = readRDS("../cache/coexpression_raw_edge_counts.RDS") %>% ungroup
coex_el_sum_by_tissue <- coex_el_sum %>%
count(n, essential_edge_score, name = "count")
The structure of Orphanet Rare Disease Ontology was queried and processed using R interface of the Ontology Lookup Service (https://lgatto.github.io/rols/index.html). A number of calculation per-computed for further analyses on this section was performed in source/Orphanet_annotate_genes_to_ancestors.R.
# load direct gene association
orpha_gene_onset_df <- readRDS("../cache/orpha_gene_onset_df.RDS")
# disease gene association with roots
orphanet_gene_association <- read_tsv("../data/orphaNet_disease_gene_association_with_roots.tsv")
# disease gene association at group level
source("../functions/readdata_functions.R")
gene_disease_orpha = process_disease_genes_data("../data/table_disease_gene_assoc_orphanet_genetic.tsv", 1, 2000)
[1] "read 28 diseases, of total 3593 associated genes."
#source("../source/read_orphanet_gene_association_data.R")
gene_disease_orpha = gene_disease_orpha$disgene_df
# Modify and merge data
orpha_gene_display_df <- orphanet_gene_association %>%
dplyr::filter(n_genes > 0) %>%
mutate(ID = as.double(str_remove(orphaID, "Orphanet:")))
DT::datatable(orpha_gene_display_df[,c("ID", "label", "n_genes", "genes")] ,
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All"))))
Rare diseases are scarcely annotated, and most disease terms (2686 out of 3771) are only associated with one gene. Network-based measurements for individual diseases are unfeasible and grouping of the terms for higher level association are necessary.
# from orphanet_mapping_top_branch
gene_per_disease = orpha_gene_onset_df %>%
dplyr::filter(!is.na(gene)) %>%
count(orphaID)
gene_per_disease_count = gene_per_disease %>%
mutate(group = cut(n, breaks = c(0:10, 100), labels = c(1:10, "> 10"))) %>%
count(group)
p = ggplot(gene_per_disease_count, aes(x = group, y =n)) + geom_col() +
ggtitle("Most orphanet diseases are immediately associated with one gene") +
theme_minimal() + ylab("Number of diseases") + xlab("Number of genes per disease")#+ scale_y_log10()
plotly::ggplotly(p)
pacman::p_load(cowplot)
#gene_per_disease_group = gene_disease_orpha %>%
# mutate(group = cut(n_genes, breaks = c(seq(0,100,20), seq(200,1000, 200), Inf))) %>% count(group)
#ggplot(gene_per_disease_group, aes(x = group, y =n)) + geom_col() +
# ggtitle("Most orphanet diseases are immediately associated with one gene") +
# theme_minimal() + ylab("Number of diseases") + xlab("Number of genes per disease")#+ scale_y_log10()
# number shift
gene_per_disease_group = gene_disease_orpha %>%
mutate(disease = "Grouped", n = N) %>%
select(disease, n)
gene_per_disease = gene_per_disease %>%
mutate(disease = "Individual")
gene_per_disease_both = gene_per_disease %>%
select(disease, n) %>%
bind_rows(., gene_per_disease_group) %>%
dplyr::filter(n>0) %>%
# relevel factor
mutate(disease = factor(disease, levels = c("Individual","Grouped")))
# add break values for plotting on log scale on x axis
breakvals = c(0:9, seq(10,90,10), seq(100,1000,100), 2000)
gene_per_disease_both_count = gene_per_disease_both %>%
mutate(group = cut(n, breaks = breakvals, include.lowest = T, labels = breakvals[-1])) %>%
group_by(disease, group) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
full_join(., tibble(group = as_factor(breakvals[-1])))
gene_per_disease_both_count$disease[is.na(gene_per_disease_both_count$disease)] = "Grouped"
gene_per_disease_both_count$n[is.na(gene_per_disease_both_count$n)] = 0
# plot
#ggplot(gene_per_disease_both_count, aes(y = prop, x = group)) + geom_col() + theme_minimal() +facet_grid(disease ~ .)
p = ggplot(gene_per_disease_both_count, aes(y = n, x = group)) +
geom_col() +
facet_grid(disease ~ ., scales = "free") +
#scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
scale_x_discrete(breaks = c(1,10,100,1000))+
geom_vline(xintercept = which(levels(gene_per_disease_both_count$group)==20), linetype = "dashed", col = "red") + theme_cowplot() + xlab("Genes per disease term") + ylab("Number of terms")
p
#ggsave("../Figs/orphanet_individual_vs_grouped_diseases_bar.pdf", p, width = 3*1.5, height = 2*1.5)
Based on the plot above, accumulating gene-disease association for descendant disease terms of ‘Rare genetic disease’ ( “Orphanet:98053”) resulting in physiologically distinct disease groups where the majority (26 out of 28) groups are associated with sufficient amount of genes for module detection (\(n=20\)).
The disease gene association can be found in data/table_disease_gene_assoc_orphanet_genetic.tsv. The summary of disease groups and all associated genes are shown below.
gene_disease_orpha_df <- gene_disease_orpha %>% select(name, N) %>% arrange(-N)
rmarkdown::paged_table(gene_disease_orpha_df)
Despite the wide range of associated genes, the average number of genes per disease term remains comparable across all disease groups, ensuring similar level of disease specificity across the disease domain.
root_and_gene_df <- orpha_gene_display_df %>%
select(ID, genes, roots) %>%
separate_rows(roots, sep = ";")
n_term_per_group <- root_and_gene_df %>%
distinct(ID, roots) %>%
count(roots, name = "n_terms")
gene_disease_orpha_df$pass <- gene_disease_orpha_df$N >= 20
disease_characteristics <- left_join(gene_disease_orpha_df, n_term_per_group, by = c("name"="roots")) %>%
mutate(genes_per_term = N/n_terms) %>%
pivot_longer(!c(name, pass), names_to = "property", values_to = "count") %>%
mutate(name = factor(name, levels = rev(gene_disease_orpha_df$name)),
property = factor(property, levels = c("N", "n_terms", "genes_per_term"),
labels = c("# genes", "# disease terms", "# genes/term")))
p <- ggplot(disease_characteristics, aes(x = name, y = count, fill = pass)) +
geom_col() +
facet_grid(.~property, scales = "free") +
theme_cowplot() +
coord_flip() +
scale_fill_manual(values = c("#bdbdbd", "#1c9099")) +
guides(fill = FALSE)
#ggsave("../Figs/barplot_disease_characteristics.pdf",p, width = 10, height = 5)
p
The summary statistics for each of these properties are as follows:
# A tibble: 3 x 2
property mean
* <fct> <dbl>
1 # genes 339.
2 # disease terms 262.
3 # genes/term 1.74
Even though some disease groups may contain some overlap terms, 90.5% of disease pairs are distinctive (Jaccard Index < 0.1) and therefore represent unique disease definition
removed_roots <- gene_disease_orpha %>% filter(N<20) %>% pull(name)
terms_per_root <- root_and_gene_df %>%
filter(!roots %in% removed_roots) %>%
group_by(roots) %>%
summarise(IDs = list(unique(ID))) %>%
filter(!is.na(roots)) %>%
pull(IDs, name = roots)
genes_per_root <- gene_disease_orpha %>%
pull(genes_all, name = name)
terms_pairwise_df <- combn(names(terms_per_root), 2) %>% t %>% as.tibble()
jaccard <- function(x1, x2){
length(intersect(x1, x2))/length(union(x1, x2))}
overlap <- function(x1, x2){
length(intersect(x1, x2))/min(length(x1), length(x2))}
terms_pairwise_df$jaccard_terms <- apply(terms_pairwise_df, 1, function(x) jaccard(terms_per_root[[x[1]]], terms_per_root[[x[2]]]))
terms_pairwise_df$jaccard_genes <- apply(terms_pairwise_df, 1, function(x) jaccard(genes_per_root[[x[1]]], genes_per_root[[x[2]]]))
terms_pairwise_df$overlap_terms <- apply(terms_pairwise_df, 1, function(x) overlap(terms_per_root[[x[1]]], terms_per_root[[x[2]]]))
terms_pairwise_df$overlap_genes <- apply(terms_pairwise_df, 1, function(x) overlap(genes_per_root[[x[1]]], genes_per_root[[x[2]]]))
terms_pairwise_df <- terms_pairwise_df %>%
mutate(label = ifelse(jaccard_terms>0.2, str_remove_all(paste(V1, V2,sep = " & "), "Rare |genetic | disorder| diseases| disease| during embryogenesis|inborn errors of |and obstetrical"),""))
p = ggplot(terms_pairwise_df, aes(x=jaccard_terms, y=jaccard_genes, label = label)) +
geom_point(aes(col = jaccard_terms), alpha = 0.7) +
scale_color_viridis_c() +
theme_cowplot() +
xlab("Jaccard (disease terms)") +
ylab("Jaccard (disease genes)") +
ggrepel::geom_text_repel() + guides(col = F)
#ggsave("../Figs/scatter_disease_similarity_jaccard.pdf", p, width = 5, height = 5)
p
Omitting disease groups with fewer than 20 associated genes, the number of terms (out of 3771) and genes excluded from further analyses include:
# table for removed terms
removed_disease_terms <- orphanet_gene_association %>%
mutate(n_roots = str_count(roots, ";")+1,
removed_roots = grepl(removed_roots[1], roots) + grepl(removed_roots[2], roots),
remained_roots = n_roots - removed_roots) %>%
filter(remained_roots == 0) %>%
select(orphaID, label, genes, roots)
removed_genes <- sapply(removed_disease_terms$genes, function(x) str_split(x, ";")) %>% unlist %>% unique()
knitr::kable(removed_disease_terms)
| orphaID | label | genes | roots |
|---|---|---|---|
| Orphanet:199241 | Pulmonary capillary hemangiomatosis | EIF2AK4 | Rare genetic respiratory disease |
| Orphanet:210122 | Congenital alveolar capillary dysplasia | FOXF1 | Rare genetic respiratory disease |
| Orphanet:217566 | Chronic respiratory distress with surfactant metabolism deficiency | SFTPC | Rare genetic respiratory disease |
| Orphanet:217563 | Neonatal acute respiratory distress due to SP-B deficiency | SFTPB | Rare genetic respiratory disease |
| Orphanet:440402 | Interstitial lung disease due to ABCA3 deficiency | ABCA3 | Rare genetic respiratory disease |
| Orphanet:440392 | Interstitial lung disease due to SP-C deficiency | SFTPC | Rare genetic respiratory disease |
| Orphanet:444092 | Autoimmune interstitial lung disease-arthritis syndrome | COPA | Rare genetic respiratory disease |
| Orphanet:31837 | Pulmonary venoocclusive disease | BMPR2;EIF2AK4 | Rare genetic respiratory disease |
| Orphanet:100051 | Hereditary angioedema type 2 | SERPING1 | Serpinopathy |
| Orphanet:100050 | Hereditary angioedema type 1 | SERPING1 | Serpinopathy |
There are 10 disease terms whose associated genes are not associated with any other disease groups, and hence these 8 genes are omitted.
# download the association
library(RColorBrewer)
orphanet_gene_association_unique_root <- orphanet_gene_association %>%
separate_rows(., roots, sep = ";", convert = T) %>% dplyr::filter(!is.na(roots)) %>%
mutate(roots = as.factor(roots))
## take a function to allow modifying alpha values
# Add an alpha value to a colour
add.alpha <- function(col=NULL, alpha=1){
if(missing(col))
stop("Please provide a vector of colours.")
apply(sapply(col, col2rgb)/255, 2,
function(x)
rgb(x[1], x[2], x[3], alpha=alpha))
}
# define colours for all disease groups
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nrow(gene_disease_orpha))
# add id 1-28
gene_disease_orpha$id = 1:nrow(gene_disease_orpha)
# add colour with corresponding alpha values to each disease group
gene_disease_orpha_mod <- gene_disease_orpha %>%
rowwise() %>%
mutate(col =mycolors[id])
#mutate(col = add.alpha(mycolors[id], N/max(gene_disease_orpha$N)))
gene_disease_orpha_mod$root = "Orphanet"
# load voronoi treemap package
if(!"voronoiTreemap" %in% rownames(installed.packages())){
pacman::p_load_gh("https://github.com/uRosConf/voronoiTreemap")
} else{
library(voronoiTreemap)
}
gene_disease_orpha_mod$plotlab = str_remove_all( gene_disease_orpha_mod$name, "Rare |genetic | disease| syndrome| disorder")
onto_json <- vt_export_json(vt_input_from_df(gene_disease_orpha_mod, hierachyVar0 = "root", hierachyVar1 = "name", hierachyVar2 = "name", colorVar = "col", weightVar = "N", labelVar = "plotlab"))
vt_d3(onto_json)
The node2vec embedding was performed to allow visualization of large networks in small space.
# only run this chuck to recompute all the coordinate values
#embed_result_dir = "./embedded_results/"
embed_result_dir = "../data/network_node2vec_results//" #only the new coexpression networks
embed_files = list.files(embed_result_dir, recursive = T)
# dim reduction
library(uwot)
library(Rtsne)
for(i in embed_files){
print(i)
df = read_delim(paste0(embed_result_dir, i), delim = " ", skip = 1, col_names = F)
df = df[apply(df, 1, function(x) !any(is.na(x))),]
df = column_to_rownames(df, var = "X1")
print("UMAP")
umap_results = uwot::umap(df, n_neighbors = 15)
print("PCA")
pca_results = pcaMethods::pca(df)
print("tsne")
tsne_results = Rtsne::Rtsne(X = df, dims=2)
umap_results_df = tibble(X = umap_results[,1], Y = umap_results[,2])
tsne_results_df = tibble(X = tsne_results$Y[,1], Y = tsne_results$Y[,2])
pca_results_df = tibble(X = pca_results@scores[,1], Y = pca_results@scores[,2])
umap_results_df$name = tsne_results_df$name = pca_results_df$name = rownames(df)
write_tsv(umap_results_df, paste("../cache/embedded_results_2D/", i, "umap.tsv", sep = "_"), col_names = T)
write_tsv(tsne_results_df, paste("../cache/embedded_results_2D/", i, "tsne.tsv", sep = "_"), col_names = T)
write_tsv(pca_results_df, paste("../cache/embedded_results_2D/", i, "pca.tsv", sep = "_"), col_names = T)
}
for(d in alldiseases){
disease = orpha$disgene_list[[d]]
for(n in embed_files){
network_name = str_replace(n, "coex/", "")
# load the node2vec embedded results
tsne_results_df = read_tsv(paste("../cache//embedded_results_2D/", network_name, "tsne.tsv", sep = "_"), col_types = 'ddc') %>%
mutate(indisease = name %in% disease)
n <- str_replace(n, "/", "_") # replace / by _ for labelling
# create directory to store results
dir.create(file.path(paste0(embedding_fig_dir, "/by_disease"), d), showWarnings = FALSE)
dir.create(file.path(paste0(embedding_fig_dir, "/by_network"), n), showWarnings = FALSE)
p <- ggplot(tsne_results_df) +
stat_density_2d(aes(X,Y, fill = stat(level)), geom = "polygon") +
theme(panel.background = element_rect(fill = '#0f2030'),
axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),legend.position="none",
panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())+
geom_point(aes(X,Y, alpha = indisease), col = "white", size = 1) +
scale_alpha_discrete(range = c(0, 0.5)) + guides(fill = FALSE, alpha = FALSE, col = FALSE)+
ggtitle(paste0(d,": ",n))
ggsave(filename = sprintf("%s/%s/%s/%s.pdf", embedding_fig_dir, "by_disease", d, n), plot = p, device = "pdf", width = 5, height = 5)
ggsave(filename = sprintf("%s/%s/%s/%s.pdf", embedding_fig_dir, "by_network", n, d), plot = p, device = "pdf", width = 5, height = 5)
}
}
# only run this chunk to replot
source("../functions/readdata_functions.R")
# folder to add figures
embedding_fig_dir = "../Figs/embedding/"
orpha = process_disease_genes_data("../data/table_disease_gene_assoc_orphanet_genetic.tsv", 20, 2000)
[1] "read 26 diseases, of total 3586 associated genes."
alldiseases = names(orpha$disgene_list)
tsne_results_df <- read_tsv("../cache//embedded_results_2D/_HP_tsne.tsv", col_types = 'ddc')
selected_diseases <- c(
"Rare genetic immune disease",
"Rare genetic cardiac disease",
"Rare genetic renal disease",
"Rare genetic bone disease",
"Rare genetic hematologic disease",
"Rare genetic neurological disorder"
)
tsne_results_df <- read_tsv("../cache//embedded_results_2D/_HP_tsne.tsv", col_types = 'ddc')
selected_diseases <- c(
"Rare genetic immune disease",
"Rare genetic cardiac disease",
"Rare genetic renal disease",
"Rare genetic bone disease",
"Rare genetic hematologic disease",
"Rare genetic neurological disorder"
)
embed_files_selected <- "HP"
embedding_fig_dir = "../Figs/embedding/"
embedplot <- list()
for(d in selected_diseases){
disease = orpha$disgene_list[[d]]
# create directory to store results
dir.create(file.path(embedding_fig_dir, d), showWarnings = FALSE)
embedplot[[d]] <- list()
for(n in embed_files_selected){
network_name = str_replace(n, "coex/", "")
# load the node2vec embedded results
tsne_results_df = read_tsv(paste("../cache//embedded_results_2D/", network_name, "tsne.tsv", sep = "_"), col_types = 'ddc') %>%
mutate(indisease = name %in% disease)
n <- str_replace(n, "/", "_") # replace / by _ for labelling
embedplot[[d]][[n]] <- ggplot(tsne_results_df) +
stat_density_2d(aes(X,Y, fill = stat(level)), geom = "polygon") +
theme(panel.background = element_rect(fill = '#0f2030'),
axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),legend.position="none",
panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())+
geom_point(aes(X,Y, alpha = indisease), col = "white", size = 1) +
scale_alpha_discrete(range = c(0, 0.5)) + guides(fill = FALSE, alpha = FALSE, col = FALSE)+
ggtitle(paste0(d,": ",n))
}
}
embedplot <- lapply(embedplot, function(x) x[[1]])
(embedplot[[1]] + embedplot[[2]] + embedplot[[3]])/(embedplot[[4]] + embedplot[[5]] + embedplot[[6]])
The computation was performed on bash script ./source/compute_localisation_analysis_allnetworks_orphanet_rare.sh and the saved results are shown below.
# library and configurations
pacman::p_load(tidyverse, ggrepel, knitr, cowplot)
pacman::p_load_gh("nightingalehealth/ggforestplot")
# for printing (cm)
A4width = 21
A4height = 29.7
# load network labels
network_info = read_tsv("../data/network_details.tsv")
source("../functions/process_LCC_result.R")
# defines colours used in the heatmap
cols = RColorBrewer::brewer.pal(5, "Blues")
cols = cols[2:5]
## a function for heatmap plotting
LCC_heatmap_plot = function(result_folder, heatmap_file){
#' @input result_folder: folder where LCC files were saved
#' @input heatmap_file: file where the heatmap is to be saved
## process the results LCC significance
result_df = readRDS(paste0(result_folder, "LCC_and_distance_calculation_results.RDS"))
processed_result_df = process_LCC_result(result_df)
processed_result_df <- processed_result_df %>% mutate(LCC.signif = factor(LCC.signif, levels = rev(levels(LCC.signif))))
p <- processed_result_df %>% dplyr::filter(LCC.signif != "none") %>%
ggplot(aes(name, subtype)) + geom_tile(fill = "white") + facet_grid(.~main_type, space = "free", scale = "free") +## to get the rect filled
# geom_point(aes(size = N_in_graph*1.7), colour = LCC.signif) + ## geom_point for circle illusion
geom_stripes(odd = "grey90", even = "#00000000") +
# theme_light() +
# theme_forest() +
theme_minimal_hgrid() +
theme(panel.spacing = unit(0.25, "lines"),
axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid.major.y = element_line(colour="grey", linetype="dotted")) +
geom_point(aes(
size = log10(N_in_graph),
fill = LCC.signif,
colour = LCC.signif), alpha = 0.7) +
scale_color_manual(values = cols) +
# scale_size(range = c(1, 10))+ ## to tune the size of circles
coord_flip() +
labs(x="", y="") + guides(size = F)
ggsave(heatmap_file, plot = p, width = 1.5*0.5*A4width, height = 1.5*0.175*A4height)
}
rare_genetic_result_folder = "../cache/output/Orphageneset_rare/"
rare_genetic_heatmap_file = "../cache/heatmap_network_disease_association_orphanets_genetic_rare_diseases.pdf"
if(!file.exists(rare_genetic_heatmap_file)){
LCC_heatmap_plot(rare_genetic_result_folder, rare_genetic_heatmap_file)
}
knitr::include_graphics(rare_genetic_heatmap_file)
# number of significant diseases
library(scales)
result_df = readRDS(paste0("../cache/output/Orphageneset_rare/", "LCC_and_distance_calculation_results.RDS")) %>%
process_LCC_result(.) %>%
mutate(LCC.signif = factor(LCC.signif, levels = rev(levels(LCC.signif))))
ntissue_per_disease = result_df %>%
dplyr::filter(LCC.signif!="none") %>%
count(name, LCC.signif) #%>% mutate(name = factor(name, levels = levels_name))
p = ggplot(ntissue_per_disease, aes(x = name, y = n, fill = LCC.signif)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = cols) +
theme_minimal_vgrid() +
ylab("number of tissues") +
coord_flip() +
scale_y_continuous(breaks= pretty_breaks())
P_without_label <- p + theme(axis.text.y = element_blank())
#ggsave("../Figs/barplot_ntissue_per_disease_for_heatmap_network_disease_association_orphanets_genetic.pdf", width = 0.25*1.1*0.5*A4width, height = 1.1*0.175*A4height)
p
# number of significant diseases
ndisease_per_tissue = result_df %>%
dplyr::filter(LCC.signif!="none") %>%
count(network, LCC.signif, main_type, subtype) %>%
mutate(network = fct_reorder(network, n, sum, .desc = T),
subtype = fct_reorder(subtype, n, sum, .desc = T))
p = ggplot(ndisease_per_tissue, aes(x = subtype, y = n, fill = LCC.signif)) + geom_bar(stat = "identity") + scale_fill_manual(values = cols)+ theme_minimal_hgrid() + ylab("number of disease groups") + scale_y_continuous(breaks= pretty_breaks()) + facet_grid(.~main_type, space = "free", scale = "free")
P_without_label <- p + theme(axis.text.x = element_blank())
#ggsave("../Figs/barplot_ndisease_per_tissue_for_heatmap_network_disease_association_orphanets_genetic.pdf", width = 1.1*0.5*A4width, height = 0.35*1.1*0.175*A4height)
p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
# is the number of significant tissues influenced by the number of genes in the disease set?
setwd("~/Documents/projects/Multiome/")
source("./functions/readdata_functions.R")
Orphanet_dat = process_disease_genes_data("./Disease_gene_assoc/Orphanet/output/table_disease_gene_assoc_orphanet.tsv", min_gene = 20, max_gene = 2000)
Orphanet_df = Orphanet_dat$disgene_df
N_signif_net = result_df %>% group_by(name) %>% dplyr::filter(LCC.signif) %>% count() %>% rename(significant_tissues = n)
Orphanet_df = left_join(Orphanet_df, N_signif_net)
library(ggrepel)
p = ggplot(Orphanet_df, aes(x = N, y = significant_tissues, label = ifelse(N > 700, name, ""))) + geom_point() + scale_x_log10()+ ggtitle("There is no strong relationship", subtitle = "between gene set size and number of significant tissues") + theme_cowplot() + xlab("Number of associated genes") + ylab("Number of significant tissues") + geom_text_repel()
ggsave("./Figs/scatter_SignificantTissues_Ngenes.pdf", p, width = 0.25*A4width, height = 0.125*A4height)
p = ggplot(result_df, aes(x = N_in_graph, y = LCC.zscore)) + geom_point() + ggtitle("There is no strong relationship", subtitle = "between gene set size and number of significant tissues") + theme_cowplot() + xlab("Number of genes in the network") + ylab("LCC z-score") + theme(legend.position = "bottom") + facet_wrap(.~name, ncol =3, scales = "free")
ggsave("./Figs/scatter_SignificantTissues_Ngenes_per_disease.pdf", p, width = 0.5*A4width, height = 0.5*A4height)
A case study of microscopic analysis using our multiple networks framework. We explore the differential modularity of gastroenterological diseases as an example.
# load cached networks and diseases and LCC results
source("../source/cache_all_networks_and_disease_genes.R")
pacman::p_load(treemap, igraph, tidygraph, ggraph)
disease_of_interest <- "Rare genetic gastroenterological disease"
# unique disease set under the term
individual_gene_set <- individual_diseases_genes %>% dplyr::filter(grepl(disease_of_interest, roots))
# all genes associated
genes_set <- rare_genetic_diseases_genes$disgene_list[[disease_of_interest]]
# list of all significant networks
signif_nets <- processed_result_df %>% dplyr::filter(name == disease_of_interest, LCC.signif != "none") %>% arrange(desc(LCC.zscore)) %>% pull(network)
# degrees each gene in each network
degree_diseasegenes <- sapply(g[signif_nets], function(x) degree(x, v = intersect(V(x)$name, genes_set)))
# igraph object of significant networks for the disease group
disease_graph <- lapply(signif_nets, function(x) g[[x]]- setdiff(V(g[[x]])$name, genes_set))
names(disease_graph) <- signif_nets
There are rnrow(individual_gene_set)` individual disease terms under Rare genetic gastroenterological disease, involving 140 genes.
# print the paged table
individual_gene_set %>%
select(label, genes, n_genes) %>%
arrange(-n_genes) %>%
rmarkdown::paged_table()
The tree map for descendants of Rare genetic gastroenterological disease are as follows (limited to terms with at least two associated genes).
# set a threshold for minimal amount of genes required to include in the plotting
threshold = 2
individual_gene_set_wrap <- individual_gene_set %>% dplyr::filter(n_genes >= threshold) %>% select(label, n_genes)
n_removed <- sum(individual_gene_set$n_genes<=threshold)
#individual_gene_set_wrap <- rbind(individual_gene_set_wrap, data.frame(label = sprintf("Other (%i diseases with one or two associated genes)", n_removed), n_genes = n_removed))
# Create data
{group <- ifelse(individual_gene_set_wrap$n_genes > threshold+2,
individual_gene_set_wrap$label, str_trunc(individual_gene_set_wrap$label, 50, side = "right"))
value <- individual_gene_set_wrap$n_genes
data <- data.frame(group,value)
treemap(data,
palette = "Set3",
title = disease_of_interest,
index="group",
vSize="value",
type="index"
)
}
# treemap - uncomment to plot
#pdf("../Figs/treemap_gastroenterological.pdf", width = 5, height = 4)
The module significance computation shows the following networks being significant:
| subtype | main_type | N_in_graph | LCC.size | LCC.mean | LCC.sd | LCC.zscore | correctedPval |
|---|---|---|---|---|---|---|---|
| Protein-protein interaction | Other | 139 | 77 | 12.844 | 8.449086 | 7.593248 | 0.0000000 |
| Human Phenotype | Other | 120 | 102 | 39.789 | 16.155292 | 3.850812 | 0.0004920 |
| Esophagus mucosa | Co-expression | 99 | 50 | 19.179 | 9.010002 | 3.420754 | 0.0021502 |
| Co-Pathway Membership | Other | 44 | 11 | 4.622 | 2.693022 | 2.368343 | 0.0363355 |
| Co-essentiality | Other | 30 | 7 | 3.465 | 1.570108 | 2.251437 | 0.0470091 |
The network differential modularity can be visualised as follows:
# convert the igraph object into df for ggraph visualisation
disease_graph_df <- lapply(disease_graph, as_data_frame) %>%
bind_rows(., .id = "network") %>%
mutate(network = factor(network, levels = signif_nets))
disease_tidygraph <- as_tbl_graph(disease_graph_df) %>%
mutate(degree = centrality_degree(mode = 'all'))
tidygraph_ggrpah <- ggraph(disease_tidygraph, layout = 'kk')
p <- tidygraph_ggrpah +
geom_edge_density(aes(fill = network),
show.legend = FALSE) +
geom_edge_fan(aes(alpha = stat(index),
col = network),
show.legend = FALSE) +
geom_node_point(col = "white",
alpha = 0.25,
show.legend = F,
size = 1) +
facet_edges(~factor(network, levels = signif_nets)) +
# black theme
theme_graph(fg_text_colour = 'white',
background = "black",
text_colour = "white")
# uncomment when need to replot and save
#ggsave2(sprintf("../Figs/networks_%s.pdf", disease_of_interest), plot = p, width = 10, height = 6)
#ggsave2(sprintf("../Figs/networks_%s_separate_dimmed.png", disease_of_interest), plot = p, width = 10, height = 6)
p
pacman::p_load(patchwork, cowplot, ggstatsplot)
knitr::opts_chunk$set(echo=FALSE)
AUC_folds_file <- "../cache/fold_cv_processed_results.RDS"
if(!file.exists(AUC_folds_file)){
source("../source/process_fold_cv_results.R")
}
palettes <- c('#fc8d62','#66c2a5','#8da0cb')
AUC_folds_summary <- readRDS(AUC_folds_file)
p <- ggplot(AUC_folds_summary,aes(x=name_short, fill = label, group = label)) +
geom_ribbon(aes(ymin = Q25, ymax = Q75), alpha = 0.25) +
geom_line(aes(y = med, col = label), linetype = 2) +
coord_flip() + xlab("disease") + ylab("fold AUC") + theme_minimal_hgrid() + theme(legend.position = "bottom") +
ggtitle("ROC-AUC comparison", subtitle = "area spans 25, 50 and 75% quantile") +
scale_fill_manual(values=palettes) + scale_color_manual(values=palettes)
#ggsave("../Figs/cvROC/AUC_folds_comparison_short_names.pdf", p, width = 6, height = 6)
p
AUC_folds_summary_plot <- tibble(label = AUC_folds_summary$label,
AUC = AUC_folds_summary$mean)
plot_df = tibble(x = factor(AUC_folds_summary$label,
levels = c( "PPI only" , "all networks" , "significant networks"),
labels = c( "PPI only" , "All networks" , "Significant networks")), y = AUC_folds_summary$mean)
barstatplot <- ggstatsplot::ggwithinstats(data = plot_df,
x = x, y = y,type = "np") +
xlab("Method") + ylab("AUC") +
scale_color_manual(values = palettes) #+ylim(0.6, 1)
#ggsave(plot = barstatplot, filename = "../Figs/10fold_cv_method_compared_boxplot.pdf", width = 6*0.8, height = 4*0.8)
barstatplot
## load the LCC results
result_folder<-"../cache/output/Orphageneset_rare/"
result_df<-readRDS(paste0(result_folder, "LCC_and_distance_calculation_results.RDS"))
source("../functions/process_LCC_result.R")
result_df<-process_LCC_result(result_df, network_annotate=F)
## count number of networks, only coex
n_signif_df<-result_df %>% dplyr::filter(LCC.signif != "none", grepl("coex", network)) %>% group_by(name) %>% count(name)
## load the gene sets
source("../functions/readdata_functions.R")
Orphanet_df<-process_disease_genes_data("../data/table_disease_gene_assoc_orphanet_genetic.tsv", min_gene=20, max_gene=2000)
[1] "read 26 diseases, of total 3586 associated genes."
Orphanet_n_gene_df<-Orphanet_df$disgene_df
# retrieve median values
AUC_folds_perf_compare <- AUC_folds_summary %>%
pivot_wider(., id_cols = name, names_from = label, values_from = med) %>%
mutate(dif = `significant networks` - `all networks`) %>%
left_join(x=., y = n_signif_df) %>%
left_join(x=., y = Orphanet_n_gene_df[,c("name","N")])
AUC_allnets_median <- AUC_folds_summary %>%
dplyr::select(-Q25, -Q75, -name_short) %>%
dplyr::filter(label == 'significant networks') %>%
left_join(x=., y = n_signif_df) %>%
left_join(x=., y = Orphanet_n_gene_df[,c("name","N")])
p1 <- ggplot(AUC_folds_perf_compare %>% mutate(label = ifelse(dif< 0, name,"")), aes(x = n, y = dif)) +
geom_point(col = "grey40") +
theme_cowplot() + guides(size = F) + geom_hline(yintercept = 0, linetype = "dotted", col = "grey20") +
# stat_summary(fun.data= mean_cl_normal) +
geom_smooth(method='lm', se = F) +
xlab("# Significant tissues") + ylab(expression("Performance diff. ("*Delta*"AUC)"))
p2 <- ggplot(AUC_allnets_median, aes(x = N, y = med)) + geom_point(col = "grey40") + theme_cowplot() +
guides(size = F) + xlab("# Associated genes") + ylab("Median AUC") +
scale_x_log10() +
geom_smooth(method='lm', se = F)
p <- p1 + p2
#ggsave(filename = "../Figs/scatter_AUC_vs_ntissues.pdf",plot = p, height = 3, width = 6)
p
cor.test(AUC_folds_perf_compare$dif,AUC_folds_perf_compare$n, method = "spearman")
Spearman's rank correlation rho
data: AUC_folds_perf_compare$dif and AUC_folds_perf_compare$n
S = 4617.8, p-value = 0.001952
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.5787341
cor.test(AUC_allnets_median$N, AUC_allnets_median$med, method = "spearman")
Spearman's rank correlation rho
data: AUC_allnets_median$N and AUC_allnets_median$med
S = 5380, p-value = 1.923e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.8393162
p <- AUC_folds_perf_compare %>%
left_join(., distinct(AUC_folds_summary, name, name_short), by="name") %>%
ungroup %>%
mutate(name_short = fct_reorder(name_short, n)) %>%
ggplot(., aes(x = name_short, y = n, fill=dif)) + geom_col() + theme_minimal() + coord_flip() + scale_fill_gradient2() + ylab("# Significant tissues") + xlab("") + labs(fill = "")+ ggtitle("Performance difference", subtitle = "Selected networks vs all networks")
#ggsave("../Figs/barchart_n_tissues_vs_performance_difference.pdf", height = 4, width = 6)
p
# A tibble: 8 x 2
Patient `number of genes`
* <fct> <int>
1 P1 47
2 P2 33
3 P3 9
4 P4 4
5 P5 6
6 P6 13
7 P7 10
8 P8 7
Average number of genes with high confident variants detected in each patient:
These patient are diagnosed, i.e. has a confirmed causal variant. We wonder whether the causal gene can be picked up by the algorithm. The patient phenotype belong to rare genetic neurological disorder.
patient_hpo_df <- read_csv("../data/patient_hpo_terms.csv",
col_names = c('Patient', "hpo_id", "hpo_label"), skip = 1) %>%
mutate(hpo_id = str_trim(hpo_id, side = 'both')) %>%
mutate(patient_ID = factor(Patient, levels = unique(Patient), label = paste0("P", 1:length(unique(Patient)))))
hpo_counts <- patient_hpo_df %>%
count(hpo_label) %>%
arrange(n) %>%
mutate(hpo_label = factor(hpo_label, levels = hpo_label))
patient_hpo_df$hpo_label = factor(patient_hpo_df$hpo_label, levels = hpo_counts$hpo_label)
p = ggplot(patient_hpo_df %>% mutate(found=T), aes(x=patient_ID, y =hpo_label, fill = found)) +
#geom_tile(col = "white") +
geom_point(aes(colour = found), size = 6) +
guides(fill = F, col=F) +
theme_minimal() + xlab("Patient") + ylab("HPO label") +
coord_equal()
#ggsave("../Figs/heatmap_patient_phenotype.pdf", p)
p
The prioritization was performed using informed propagation, with seed genes specific to each patient.
precomputed_patient_ranked_genes <- "../cache/patient_ranking_results.RDS"
if(!file.exists(precomputed_patient_ranked_genes)){
source("../source/patient_causal_gene_prediction.R")
}
# # load precomputed results
patient_annotated_df <- readRDS(precomputed_patient_ranked_genes) %>%
mutate(Patient = factor(Patient, level = c("N_064", "N_126", "N_020", "N_026", "N_003", "N_129", "N_007",
"N_040"), labels = paste0("P", 1:8)))
Below is the causal gene for each patient, and how well the algorithm performs for each case.
# A tibble: 8 x 5
# Groups: Patient [8]
Patient GENE Diagnostic network_rank total_variants
<fct> <chr> <chr> <dbl> <int>
1 P1 PRDM12 YES 1 46
2 P2 LAMA1 YES 1 33
3 P3 HUWE1 YES 1 9
4 P4 CNTNAP2 YES 1 4
5 P5 CDKL5 YES 2 6
6 P6 LAMA1 YES 4 13
7 P7 BCL11A YES 4 10
8 P8 CHD2 YES 4 7
# ensure it keeps the order
p_rank = list()
for( i in unique(patient_annotated_df$network_set)){
correct_rank <- patient_annotated_df %>%
filter(Diagnostic=="YES", network_set == i) %>%
select(-network_set, -avg) %>%
arrange(network_rank, -total_variants)
p_rank[[i]] <- ggplot(correct_rank) +
geom_segment( aes(x=Patient, xend=Patient, y=0, yend=total_variants), color="grey", size = 4) +
geom_point(aes(x=Patient,y=network_rank), size=5, col="orange", shape = 15) +
geom_text(aes(x=Patient,y=network_rank, label = network_rank), col = "white", fontface = "bold")+
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank()
) +
xlab("Patient") +
ylab("# Candidate genes")
#ggsave(p_rank[[i]], filename = paste0("../Figs/patient_gene_ranking_",i,".pdf"), width = 2.5, height = 3)
}
library(patchwork)
p_rank[["signif"]] + ggtitle("Signif") +
p_rank[["all"]] + ggtitle("All networks") +
p_rank[["ppi"]] + ggtitle("PPI only")
AUC_plot <- patient_annotated_df %>%
mutate(pos = (network_rank-1)/(total_variants-1),
label = ifelse(Diagnostic=="NO", 0,1))
pacman::p_load(pROC)
rocvals <- list()
rocvals_df <- list()
for(i in unique(AUC_plot$network_set)){
df <- AUC_plot %>% filter(network_set==i)
response <- df$label
predictor <- df$pos
rocvals[[i]] <- pROC::roc(response, predictor)
rocvals_df[[i]] <- tibble(Sensitivity = rocvals[[i]]$sensitivities, Specificity = rocvals[[i]]$specificities)
}
rocvals_merged_df <- bind_rows(rocvals_df, .id = "rank_type")
palettes <- c('#66c2a5','#fc8d62','#8da0cb')
p <- ggplot(rocvals_merged_df, aes(x = Specificity, y = Sensitivity, col = rank_type)) +
geom_line(size=2) +
scale_x_reverse() +
scale_color_manual(values = palettes) + guides(col = FALSE) +
cowplot::theme_cowplot()
#ggsave(p, filename = "../Figs/patient_gene_ranking_AUC.pdf", width = 3, height = 3)
p
The AUC for each method is as shown below.
# A tibble: 3 x 2
methods auc
<chr> <dbl>
1 all 0.872
2 signif 0.840
3 ppi 0.718
The corresponding p-values are:
# A tibble: 3 x 3
V1 V2 pval
<chr> <chr> <dbl>
1 all signif 0.691
2 all ppi 0.192
3 signif ppi 0.329
pacman::p_load(report)
report_session <- report(sessionInfo())
write_lines(report_session, "report_session.md")
print(report_session)
Analyses were conducted using the R Statistical language (version 3.6.3; R Core Team, 2020) on macOS 10.16, using the packages voronoiTreemap (version 0.2.1; Alexander Kowarik et al., 2021), cowplot (version 1.1.1; Claus Wilke, 2020), igraph (version 1.2.6; Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. https://igraph.org), RColorBrewer (version 1.1.2; Erich Neuwirth, 2014), ggplot2 (version 3.3.3; Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.), stringr (version 1.4.0; Hadley Wickham, 2019), tidyr (version 1.1.2; Hadley Wickham, 2020), forcats (version 0.5.1; Hadley Wickham, 2021), scales (version 1.1.1; Hadley Wickham and Dana Seidel, 2020), readr (version 1.4.0; Hadley Wickham and Jim Hester, 2020), dplyr (version 1.0.4; Hadley Wickham et al., 2021), ggforestplot (version 0.1.0; Ilari Scheinin et al., 2021), rmarkdown (version 2.7; JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone, 2021), ggrepel (version 0.9.1; Kamil Slowikowski, 2021), tibble (version 3.1.0; Kirill Müller and Hadley Wickham, 2021), purrr (version 0.3.4; Lionel Henry and Hadley Wickham, 2020), report (version 0.3.0; Makowski et al., 2020), treemap (version 2.4.2; Martijn Tennekes, 2017), ggstatsplot (version 0.7.0; Patil, 2018), pacman (version 0.5.1; Rinker et al., 2017), ggraph (version 2.0.4; Thomas Lin Pedersen, 2020), patchwork (version 1.1.1; Thomas Lin Pedersen, 2020), tidygraph (version 1.2.0; Thomas Lin Pedersen, 2020), MASS (version 7.3.53; Venables et al., 2002), tidyverse (version 1.3.0; Wickham et al., 2019), pROC (version 1.17.0.1; Xavier Robin et al., 2011) and knitr (version 1.31; Yihui Xie, 2021).
References
----------
- Alexander Kowarik, Bernhard Meindl, Malaver Vojvodic, Mike Bostock and Franck Lebeau (2021). voronoiTreemap: Voronoi Treemaps with Added Interactivity by Shiny. R package version 0.2.1. https://github.com/uRosConf/voronoiTreemap
- Claus O. Wilke (2020). cowplot: Streamlined Plot Theme and Plot Annotations for 'ggplot2'. R package version 1.1.1. https://CRAN.R-project.org/package=cowplot
- Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. https://igraph.org
- Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
- H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
- Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
- Hadley Wickham (2020). tidyr: Tidy Messy Data. R package version 1.1.2. https://CRAN.R-project.org/package=tidyr
- Hadley Wickham (2021). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1. https://CRAN.R-project.org/package=forcats
- Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. R package version 1.1.1. https://CRAN.R-project.org/package=scales
- Hadley Wickham and Jim Hester (2020). readr: Read Rectangular Text Data. R package version 1.4.0. https://CRAN.R-project.org/package=readr
- Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.4. https://CRAN.R-project.org/package=dplyr
- Ilari Scheinin, Maria Kalimeri, Vilma Jagerroos, Juuso Parkkinen, Emmi Tikkanen, Peter Würtz and Antti Kangas (2021). ggforestplot: Forestplots of Measures of Effects and Their Confidence Intervals. https://nightingalehealth.github.io/ggforestplot/index.html, https://github.com/nightingalehealth/ggforestplot.
- JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.7. URL https://rmarkdown.rstudio.com.
- Kamil Slowikowski (2021). ggrepel: Automatically Position Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.1. https://CRAN.R-project.org/package=ggrepel
- Kirill Müller and Hadley Wickham (2021). tibble: Simple Data Frames. R package version 3.1.0. https://CRAN.R-project.org/package=tibble
- Lionel Henry and Hadley Wickham (2020). purrr: Functional Programming Tools. R package version 0.3.4. https://CRAN.R-project.org/package=purrr
- Makowski, D., Ben-Shachar, M.S., Patil, I. & Lüdecke, D. (2020). Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption. CRAN. Available from https://github.com/easystats/report. doi: .
- Martijn Tennekes (2017). treemap: Treemap Visualization. R package version 2.4-2. https://CRAN.R-project.org/package=treemap
- Patil, I. (2018). ggstatsplot: 'ggplot2' Based Plots with Statistical Details. CRAN. Retrieved from https://cran.r-project.org/web/packages/ggstatsplot/index.html
- R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
- Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
- Thomas Lin Pedersen (2020). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package version 2.0.4. https://CRAN.R-project.org/package=ggraph
- Thomas Lin Pedersen (2020). patchwork: The Composer of Plots. R package version 1.1.1. https://CRAN.R-project.org/package=patchwork
- Thomas Lin Pedersen (2020). tidygraph: A Tidy API for Graph Manipulation. R package version 1.2.0. https://CRAN.R-project.org/package=tidygraph
- Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
- Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
- Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77 <http://www.biomedcentral.com/1471-2105/12/77/>
- Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.31.